In this document we summarise the data about the school system and the state of school infrastructure that we currently have at our disposal. Our main source is the Italian Ministry of Education, University and Research (MIUR) which provides directly most of the data we present here through an open-access webpage.

Another openly accessible source is the “Scuola in chiaro” portal through which it is possible to browse among the single schools. The information displayed through this portal comes both from the aforementioned MIUR Open Data and from additional data that the schools are enabled to provide based on another database controlled by the MIUR, namely the Information Sistem of Education (SIDI).

All Italian schools are identified through a unique 10-character ID called mechanographical code. Its two first characters denote the province and the following two characters identify the type of school.

School anagraphics

The school anagraphics database (available here) includes both public and private schools, yet we are going to consider only the former in that the school buildings dataset (next section) does not include private schools. Data are detailed at the level of single schools and each school is mapped to its reference institute (i.e. a unique institute may serve more than one venue). Each school on its turn is located in a physical building. Buildings IDs are taken from the school buildings dataset which has been joined to this one via the schools’ mechanographical code.

Here we show an example of the hierarchy among reference institute, school, and school building. A single school may be made up of several distinct buildings, but one building may host more than one school. The top node is the reference institute (BAIS033007), a Comprehensive Institute in this case; the intermediate nodes are the single schools identified via the mechanographical code; the bottom nodes are the school buildings. Since that of school building is a physical concept it is intuitive to link a spatial location to a building. Indeed the initial digits of the building ID are the ID code of the municipality in which the building is. In this example we have two schools located in the municipality of Casamassima (BA, municipality code = 72015) and five other schools distributed among three buildings located in the municipality of Acquaviva delle Fonti (BA, municipality code = 72001). Thus, to summarise, the reference institute heads 7 different high schools which are located in a total of 4 buildings.

example.scheme()

As it can be seen there is no one-to-one relationship between the school and the physical building. In order to have as many observed territorial units as possible, the target variable (i.e. the Invalsi score) has been detailed at the level of the municipality in which buildings are located rather than at the level of the venue of the reference institute. Therefore we only need to aggregate the school buildings dataset at the municipality level (for province-level data the issue is easier since data are observed in each province and all school buildings belong to the same province as the reference institute).

Comprehensive Institutes

Here the link between the mechanographical code and the education grade is briefly summarised. A major issue arises when it is not possible to assign a single school to a given education grade (i.e. primary school, middle school and high school) by means of the mechanographical codes.
This is due to two reasons. On the one hand, some types of schools do not follow the same rules as the ordinary ones (like religious schools, which are not relevant in our analysis so in this case the problem does not even exist). On the other hand, single schools classified as Comprehensive Institutes (code IC, henceforth referred to with this code) cannot be linked to a specific education grade.
In addition, there may be a duplication issue: Comprehensive Institutes (as reference institutes) include a number of different schools belonging to different grades. For each set of schools belonging to one reference Comprehensive Institute, besides the single schools classified in detail as primary, middle, or high schools of a specific type (e.g. scientific or classical high schools, art institutes, technology institutes, professional institutes), it may happen to find a single school classified as a Comprehensive Institute. In this case no education order can be assigned to the school, nor is there any information about whether the mechanographical code identifies a standalone school with its standalone building or whether the record is simply a duplicate. Hence the problem with ICs is twofold:

  • How to assign the proper grade to the data (included in the school buildings DB) of these records?
  • Is there a physical structure to which these data are referred, or should these rows be counted as duplicates?

Most schools classified as ICs have a physical building ID that appears more than once (e.g. 4.816 over 5103 for year 2021-2022), which means that in these cases the same information is repeated both for a properly identifiable school and for a school classified as IC:

# 2022 data
table(ifelse(ICs22$CODICE_EDIFICIO %in% doub.buildings22$CODICE_EDIFICIO, "Duplicated", "Unique") )
## 
## Duplicated     Unique 
##       4816        287
# 2023 data
table(ifelse(ICs23$CODICE_EDIFICIO %in% doub.buildings23$CODICE_EDIFICIO, "Duplicated", "Unique") )
## 
## Duplicated     Unique 
##       4837        289

For certain ICs whose building code is unique among the DB we notice that some physical variables still appear to be redundant with other buildings having that school as the reference institute (e.g. an IC has the same school volume as another building among the pertaining schools). In case this aspect is deemed worthy of being inspected, here we link to an interactive table displaying some physical variables (physical measures and address) of ICs whose building code is unique. It is necessary to visualize this file with a Google Drive address to use interactive selection. Please notice that as this is a completely open-access file all data displayed in it are information of public domain, downloaded for free from openly accessible web pages.

Superior Institutes

A similar problem occurs with the category of Superior Institutes (ISs). ISs are often the reference institute for a number of high schools, hence schools with mechanographical codes including “IS” are certainly high schools; the only problem is the duplication issue we have just discussed for ICs.

Similarly to ICs, most ISs have a building code which appears also for other schools:

# 2022 Data
table(ifelse(ISs22$CODICE_EDIFICIO %in% doub.buildings22$CODICE_EDIFICIO, "Duplicated", "Unique") )
## 
## Duplicated     Unique 
##       1731        155
# 2023 Data
table(ifelse(ISs23$CODICE_EDIFICIO %in% doub.buildings23$CODICE_EDIFICIO, "Duplicated", "Unique") )
## 
## Duplicated     Unique 
##       1733        149

As for ICs, here it is possible to visualize all schools referring to a Superior Institute with an unique building code

Given the ambiguity of these records, in the absence of further information about the physical nature of schools classified as ICs or ISs we have decided not to include them in the covariates DB which is discussed in the section below.

School buildings data

This is the main database of our analysis (link). To the best of our knowledge this is by far the most important source of information about the territorial distribution of school infrastructure. It only includes public schools. Data are detailed at the level of physical buildings and single schools, hence each row is a unique combination of these two IDs. Most variables are Boolean.

Schools need to be grouped either by education grade and municipality or by education grade and province. The major issue lies in the imputation of undefined records - i.e. all responses coded as “NON DEFINITO” (“undefined”, indeed). Additionally, since some variables have a number of undefined observations in \(O(10^3)\) or \(O(10^4)\), they cannot be used as covariates in the development of the forthcoming analysis, hence they will be ruled out. Since by default we delete all rows displaying a missing observation, we should be balancing the horizontal and the vertical cutout (e.g. in the absence of a column cutout we should filter out more than 20.000 rows, and vice versa), hence the need for a numerical threshold, which we have established as 1000 missing observations needed to rule out a column.
Moreover, we rule out some variables that are simply of scarce interest or cannot be aggregated at the province or municipality level. These same operations may be repeated for any other available year other than 2021-2022.
Only when we have solved each ambiguity caused by missingness can we aggregate our records at the municipality or province level.
In the Appendix we display how we have chosen to clean the dataset and aggregate data.
As an example of how the final dataframes are structured, we show the percentage of high schools served by urban public transport in the Apulia region for the school year 2021-2022. Clearly, the not available areas are the municipalities in which no high school is located.

Map_covar1(Year = 2022, nfield = 31, level = "Municipality", region_code = 16, plot = "Mapview", pal = "Blues", col.rev = FALSE, type ="Superiore")

Here we provide a link to a complete mapping of all the variables included in the final dataframes:

Year Province Municipality (Apulia region)
2021-2022
2022-2023

Number of students

As for the school anagraphics, the catalogue of data regarding the number of students includes both public and private schools, but given that our main concern is in school buildings data only public schools are considered here. Only 2021-2022 data are currently available on the MIUR website. In the “Scuola in Chiaro” portal the detail about the number of students per year of study is available only for the current year, namely 2023-2024, while for years 2021-2022 and 2022-2023 we only have the total number of students per school. Here we have chosen to employ the former source, namely the MIUR Open Data. Available datasets for public schools are the following:

  1. Number of students by school, school year and age. Includes 28327 schools.
  2. Number of students by school, school year and nationality status (i.e. Italian, UE citizens and non-UE citizens). Includes 28327 schools.
  3. Number of students and classes by school, school year and gender Includes 28327 schools.
  4. Number of students by school, school year, high school type and gender. Includes 6423 schools.
  5. Number of students by school, school year, gender and school running time (i.e. full-time or normal time for primary schools and normal time or musical address (?) for middle schools ). Includes 21904 schools

The first three datasets are perhaps the most interesting ones in that they can be directly linked to the main DB. All these DBs are detailed by the school code and the school year. However, while the combinations of school code and year are the same for DBs 1) and 3), DB 2) lacks 4120 of such combinations:

Here we summarise the number of schools for which the number of students is available

MIUR_StudNumAvailability %>% group_by(TIPO) %>%
  summarise(DISPONIBILE = sum(DISPONIBILE_NUMERO_STUDENTI),
            NON_DISPONIBILE = n() - sum(DISPONIBILE_NUMERO_STUDENTI))

As can be seen the number of students is completely unavailable for Comprehensive Institutes and for Superior Institutes yet the frequency of missing records for the clearly classified schools is relatively low (4.79% for high schools, 1.83% for elementary schools, 3.59% for elementary schools). In the absence of information provided by the MIUR Open Data we have tried to resort to the “Scuola in Chiaro” portal, but for most of these schools there is no relevant information on this site either. As a consequence we just cannot include them in the covariates file.
Among the five datasets listed above, we are going to employ the dataset with the detail of the number of classes for each year.

As it has been said in previous sections, we ultimately need to detail data at the municipality level (or province level, which is easier). However, the student number data only have the school detail with no information about the physical buildings and some schools (98 schools in 2022 of which 52 are in the scope of this research) are served by several buildings located in different municipalities. Therefore we need a criterion to map unambiguously each school to one municipality. This aim can be accomplished by relying on the school anagraphics dataset (first section) in which each row corresponds to one school and the municipality of the main physical structure for that school is provided. Based on this information we have aggregated this dataset at the municipality and province level.

Number of teachers

Number of teachers, both tenured and substitutes by age class and education grade (i.e. kindergarten, primary school, middle school and high school). Data are detailed only at the province level, therefore it is impossible to map this information to municipalities or single schools. In the following map we show the number of teachers per student across the Italian provinces in the school year 2021-2022:

Prov_shp %>% rename(CODICE_PROVINCIA = COD_PROV ) %>% dplyr::select(CODICE_PROVINCIA) %>%
  left_join(filter(Docenti_per_alunno_statale_2022, ORDINE_SCUOLA == "SCUOLA SECONDARIA II GRADO"),
            by = "CODICE_PROVINCIA") %>% 
  mapview(zcol = "Docenti_per_alunno", popup = paste0(set.popup.height(220), leafpop::popupTable(.)),
          col.regions = grDevices::hcl.colors(nrow(.)-3, palette = "Blues")) 

However we can access information related to school year 2022-2023. It is available in the “Scuola in Chiaro” portal, where it is possible to browse single schools by school name. When there is all the necessary data to implement a dataset referring uniquely to year 2022-2023 this dataset will be added as well. The extraction of 2022-2023 data is allowed by the tabular structure through which they are displayed, since tables are identifiable among the HTML code of a webpage. Here we link to the scraping function employed.

Appendix

School buildings dataset

Here is an example of how the DB is structured (top slider scrolls columns, bottom slider scrolls rows). This is the raw dataset as it has been dowloaded from the MIUR website.

# DB22_MIUR is the DB including all files from the school buildings repository
head(DB22_MIUR)

This table summarises the percentage of missing responses by province, whereas here is a more general summary of the absolute number of missing responses per variable. For a higher detail of the spatial distribution of missingness, it is possible to check these maps. Only the first 37 variables are included (those which may have a greater influence on school performance and education level)

Year Province Municipality (Apulia region)
2021-2022
2022-2023

The missingness pattern is homogeneous: for a high number of fields there are some schools for which almost all data are missing. To illustrate this, here we show the units in which the first variable is missing (305 units for the year 2021-2022, 493 units for 2022-2023). The pattern is stronger for 2021-2022.

DB22_MIUR %>%  filter(CONTESTO_SENZA_DISTURBI == "NON DEFINITO") 
DB23_MIUR %>%  filter(CONTESTO_SENZA_DISTURBI == "NON DEFINITO") 
Given this missingness pattern, we clean the data in the following way; we only show the procedure for year 2021-2022.

After deleting some manually selected variables that either cannot be aggregated or simply are of scarce interest (e.g. CAP), we perform a twofold automatic cutout. First, we rule out all the columns with more than 1000 missing observations (argument “col.cut.thresh”), then we filter out all rows in which any missing variable remains. As it can be seen, after filtering out the rows in which the first variable is not observed only 6 rows with missing variables remain.

Lastly, we clean out the columns which cannot be expressed as Boolean variables.

cutout <- c("CAP", "VICINANZA_ALTRI_DISTURBI","ALTRE_CRITICITA_SPECIFICHE", "TIPOLOGIA_INDIRIZZO","DENOMINAZIONE_INDIRIZZO", "NUMERO_CIVICO", "ALTRO", "ALTRI_ACCORGIMENTI","ALTRO_SPECIFICARE")

DB22_MIUR_clean <- DB22_MIUR %>% Clean_MIUR_DB(startcol = 10, cutout = cutout, col.cut.thresh = 10^3,
                                               pattern_out = "NON DEFINITO")
## [1] "deleted 305 units whose field CONTESTO_SENZA_DISTURBI is missing"
## [1] "deleted 2 units whose field SPAZI_DIDATTICA is missing"
## [1] "deleted 4 units whose field SCUOLABUS is missing"
# We further cut out the non-Boolean variables
DB22_MIUR_bool <- DB22_MIUR_clean %>% dplyr::select(1:36) %>% gsub.bool(startcol = 9)

Now it is possible to aggregate data since no ambiguous record remains. We show the final dataframe detailed at the province level:

DB22_MIUR_prov